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Abstract 

Background: Privacy protecting is an important issue in medical informatics and differential privacy is a state-of- 
the-art framework for data privacy research. Differential privacy offers provable privacy against attackers who have 
auxiliary information, and can be applied to data mining models (for example, logistic regression). However, 
differentially private methods sometimes introduce too much noise and make outputs less useful. Given available 
public data in medical research (e.g. from patients who sign open-consent agreements), we can design algorithms 
that use both public and private data sets to decrease the amount of noise that is introduced. 

Methodology: In this paper, we modify the update step in Newton-Raphson method to propose a differentially 
private distributed logistic regression model based on both public and private data. 

Experiments and results: We try our algorithm on three different data sets, and show its advantage over: (1) a 
logistic regression model based solely on public data, and (2) a differentially private distributed logistic regression 
model based on private data under various scenarios. 

Conclusion: Logistic regression models built with our new algorithm based on both private and public datasets 
demonstrate better utility than models that trained on private or public datasets alone without sacrificing the 
rigorous privacy guarantee. 



Introduction 

Data about individuals are being collected at an unpre- 
cedented speed, which brings new opportunities for 
scientific discovery and healthcare quality improvement. 
In the meantime, there is increasing concern about peo- 
ple's privacy and inappropriate disclosure of sensitive 
information [1]. This problem is especially challenging 
in biomedicine [2], where information sharing is one of 
the biggest pillars to facilitate meaningful analysis of 
complex medical data. For example, classifying complex 
or rare patterns in clinical and genomic data requires 
the availability of a large, labeled patient set, which 
needs to be obtained from multiple institutions [3]. 

Any data access mechanism involves a tradeoff between 
the privacy risk and the data utility. In biomedicine, data 
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custodians can change the content of the data to make 
more difficult for attackers to re-identify individuals 
(k-anonymity [4], 1-divergence [5], t-closeness [6], etc.) or 
can perturb the outputs of a query result to ensure 
"indistinguishability" of individuals (i.e., count queries 
satisfying differential privacy [7]). Because differential 
privacy [8] provides a provable guarantee and is immune 
to attacks with auxiliary information, it is acknowledged 
as a state-of-the-art privacy definition [9]. The perfor- 
mance (i.e., privacy and utility) of a differentially private 
method is highly dependent on the nature of the applica- 
tion and the capability of the protection mechanism. To 
meet the need of different applications, many customized 
differentially private methods, including decision trees 
[10], logistic regression [11], principal components analy- 
sis [12], multi-class Gaussian classifiers [13], have been 
developed. There are several recent efforts in integrating 
the differential privacy framework into the system design 
and case studies for statistical health information release 
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[14], [15]. However, due to an inherited challenge of dif- 
ferential privacy in considering the entire sample space, 
the level of data perturbation often increases too quickly 
when the privacy assurance becomes stronger, which 
ends up adding too much noise [16], i.e., producing use- 
less, albeit protected, data. 

We believe the situation can be alleviated in an envir- 
onment where both public and private data sets for the 
same study are available for analysis. This is useful in 
biomedical research (e.g., randomized clinical trial), 
where some patients are willing to sign an open-consent 
agreement to make their data (publicly) available for 
research, while other patients prefer to limit disclosure 
to a single institution. Our idea is to develop hybrid 
data mining models using both public and private data 
sets in a differentially private and distributed manner to 
achieve improved utility of the disclosed data. We will 
focus on the logistic regression model, which is one of 
the most popular approaches in biomedicine, to develop 
a distributed and privacy preserving solution in the 
healthcare context. 

Related work 

Our model is closely related to Grid Logistic Regression, 
a model developed by Wu [3]. Their model is based on 
a distributed Newton-Raphson algorithm, however, it 
does not consider privacy risk during the exchange of 
aggregated statistics among participants. A recent work 
by Wu et al. [17] discusses institutional privacy of dis- 
tributed logistic regression and introduces a secure-sum 
based approach to protect aggregated statistics using a 
trusted server, but it does not meet the differential priv- 
acy criteria. The underlying intuition of our model is 
close to Elkan's work [18] to represent a confidential 
database via importance weighting elements of a public 
database for general data mining purpose, but his 
approach is also not differentially private. 

There are also some previous works on differentially pri- 
vate distributed learning. For example, Pathak, Rane, Raj 
et al. [19] suggested running local models on each private 
data set and aggregating estimated parameters. This 
approach is different from ours in two aspects. First, their 
model does not take public data into consideration. Sec- 
ond, they only prove that the final outputs (the aggregated 
parameters) are differentially private, while there is no 
guarantee that the intermediary outputs from individual 
private data sets (which need to be shared during the pro- 
cess) are differentially private. In contrast, our model 
ensures differential privacy for all steps. Rajkumar and 
Agarwal [20] recently proposed a distributed differentially 
private stochastic gradient descent algorithm, which also 
differs from ours as follows: (1) their approach used only 
private data; (2) their approach is (e, ^-differentially private 
(weaker) while ours is e-differentially private (stronger). 



In this paper, we introduce a new distributed logistic 
regression model that runs on many data sets, e.g., both 
public and private ones. It treats these two kinds of 
data sets differently: it leverages public data sets to 
improve utility while protecting the private data sets. 
Background section introduces some background 
knowledge and methodology section elaborates on 
details of our method. In experiments and results sec- 
tion, we compare our model to other approaches, and 
explore the impact of different settings (i.e., the fraction 
of data that are public, the number of distributed 
private data sets, and impact of the regularization 
parameter) on the final model. Finally, discussions and 
conclusions are drawn. 

Background 

In this Section, we will briefly review techniques related 
to this article. 

Differential privacy 

Differential privacy (DP) is a privacy definition proposed 
by Dwork, Kenthapadi, McSherry et al [8], which states 
that any answer to a query based on a private data set 
should not be altered dramatically with the change of a 
single record in the data set. 

In the following part of this paper, D and D' always 
differ on at most one sample (i.e., D and D' are "neigh- 
bors" with only one sample replaced). 

Definition 1: Query function Function / : D — » R p is 
a query function, if it is a projection from a data set D 
(not a single sample, but the whole data set) to RP. 

Many data mining models can be viewed as a query 
function, for example, the coefficients of the logistic 
regression can be seen as the projection of a data set to 
a real-valued vector. 

Definition 2: e-Differential Privacy A randomized 
algorithm (or mechanism) f is e-differentially private if 
for any neighbors D and D' and for any SeRP when the 
following probabilities are well-defined, 

Pr[f (D) eS] < e e x Pr[f (D') eS\ 

where the probabilities reflect the randomness of the 
algorithm f. Note the parameter 6 is called privacy 
budget. The smaller e is, the better privacy is preserved, 
and vice versa. 

Definition 3: Sensitivity A query function f's sensi- 
tivity under norm | . | is defined by 

S/,||.||=max||(D)-/(D')||. 

Definition 4: Laplacian mechanism [21] For any 
query function / : D R p , the algorithm returns 

/ ll<5lk\ 

f(D) = f(D) + S, where S : p (<5) cx exp , satis- 

V s /,ii.ii/ 

fies differential privacy. 
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If two independent mechanisms are fi and f2 differen- 
tially private, running them iteratively on the same data 
set will consume a privacy budget ei + f2, which is 
known as the sequential composition property of differ- 
ential privacy [22]. 

Typically, Lj norm is used in calculating sensitivity and 
applying Laplacian mechanism, in which case the noise 
on different dimensions are independent. In this paper, 
we consider the L2 norm instead, such that the sensitivity 
of penalized logistic regression parameter (i.e., the out- 
puts of a query function) can be bounded, as proved in 
Corollary 2 by Chaudhuri et al [11]. The L2 norm has 
been used in previous differential private algorithms, e.g., 
the work of Chaudhuri [11] and Rajkumar [20]. 

Newton method for logistic regression 

The Newton method (also known as the Newton-Raphson 
method [23]) is an iterative approach that uses gradient to 
find roots of a real-valued differentiable function. Since a 
function's extrema are also the roots of its gradient, the 
Newton method can also be used to find twice differenti- 
able function's extrema. Due to its efficiency in handling 
convex functions, (i.e., usually only a few iterations (five or 
six) are needed to reach a very high precision [24]), the 
Newton method is a popular numerical approach for 
building a logistic regression model [25]. Given the log- 
likelihood function of a logistic regression model L(/S), the 
Newton method approaches the maximum likelihood 
coefficients estimate f5 with the following steps, 
Initialize fio = 0, 

Compute the gradient and Hessian matrix of L(fi), 



grad ■■ 



II 



9 2 L(/i) 



dp 2 



Update p 0 = p 0 — H l grad and repeat the second step 
until fio converges. 

Methodology 

Our goal is to develop a distributed logistic regression 
model that effectively synthesizes data (public and pri- 
vate) across different sites (institutions) in a differentially 
private manner. 

Assumptions about the data set 

To develop the model in the biomedical context, we will 
make some assumptions about the data sets. First, the 
number of samples in each data set is not very large. This 
assumption is reasonable as otherwise one data set is 
enough to build the model and many data sets will not 



bring much benefit. Second, the size of the public data set 
is significantly smaller than that of the private data sets. 
The reason is that by default biomedical data should be 
kept secure and private unless patients are willing to sign 
open-consent agreements to make their medical data 
available for research, which only applies to a small per- 
centage of the total data. Third, we assume samples in dif- 
ferent data sets follow the same predictive rule, which 
means P (label\predictors) must be the same across all data 
sets. This is necessary for constructing a distributed logis- 
tic regression model that can provide useful information 
in biomedical research. In practice, such assumption can 
be verified by checking the goodness-of-fit (e.g., Hosmer- 
Lemeshow test [26]) of the fitted local and global models 
without sacrificing individual privacy. Last, we assume that 
the distribution of samples in different data sets 
{P (predictors)) are similar, although this assumption will 
be relaxed, as we will elaborate in the discussion section. 
Notation 

All these samples are Independent and Identically Dis- 
tributed (i.i.d.), and each sample has a binary label 
ye {1,-1} and a predictor vector x. The k private data 
sets (namely Di,...Dk) have ri\, ...,nu samples respec- 
tively, and the i-th sample in the ;-th data set is denoted 
as (x^yj)- The public data set D Q has samples 
(x°, y?) , i = I, ... , no- The penalized logistic regression 
maximizes the following log-likelihood function 



\\m\ 2 



and the ji that maximizes this log-likelihood function 
is the estimated parameter for the model. 

Note that for the bounded sensitivity, if an intercept is 
introduced into the predictors, the corresponding para- 
meter should be also regularized. 

Method description 

Because the logistic regression model has no analytic 
solutions, we need to solve it using numerical methods 
like the Newton algorithm (see the Background section), 
which involves several iterations of optimization. As the 
distributed logistic regression model is supposed to be 
trained on multiple data sets, we need to implement a 
privacy-preserving information exchange mechanism to 
transmit intermediary results across private data sets. In 
addition, the privacy budget is limited and it must be 
split across iterations. To maintain data utility, we must 
balance the number of iterations and the privacy budget 
spent on each iteration in the training process. 

To use the Newton-Raphson algorithm, we need to 
compute the gradient and the Hessian matrix from the 
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data sets. The simplest way to ensure differential privacy 
is to add Laplacian noise to the gradient and the Hessian 
matrix, and use the noisy version of these intermediary 
results to update parameters. Theoretically, the impacts 
of additive noise in Laplacian mechanism (see Definition 
4) tend to be much smaller when the number of samples 
approaches infinity, as the sensitivity of the gradient and 
the Hessian matrix is irrelevant to the size of a data set. 
In reality, however, the effects of noise on the gradient 
and the Hessian matrix (for parameter estimation) are 
quite different. For example, the gradient is usually 
affected by the additive noise than is the Hessian matrix. 
This is because the gradient has linear impact on the 
parameter updates. However, this is not the case for the 
Hessian matrix. Since the inverse of the Hessian matrix is 
used in the update step, even a little noise in the Hessian 
matrix can lead to large changes on the parameters being 
updated. Such change can become very large when the 
noise destroys the Hessian matrix's positive definiteness, 
which implies that a global optimal solution (like the one 
for the log-likelihood function of a traditional logistic 
regression model) may not be attained. Although we can 
threshold the eigenvalues of the Hessian matrix to ensure 
positive definiteness, this method might generate useless 
coefficients. Therefore, the key to build a useful differen- 
tially private distributed logistic regression using the 
Newton-Ralphson algorithm is to reduce the noise in the 
Hessian matrix, especially reduce the chance of a non- 
positive definite Hessian matrix. 

Our approach is to leverage public data sets (i.e., con- 
tributed by patients who signed the open-consent agree- 
ment). In our hybrid framework, the Hessian matrix is 
estimated solely using the public data set, and we use 
public and private data sets to compute the gradient. To 
estimate the Hessian matrix using only public data, we 
leverage the following advantages: (1) The Hessian matrix 
from the public data set is positive definite. Therefore, 
the worst case discussed above is avoided even though 
the absolute sample error on the Hessian matrix might 
be larger than the noisy Hessian matrix from individual 
data sets. (2) The sensitivity of Hessian matrix is 0(p 2 ) 
when there are p features in each sample, while the sensi- 
tivity of the gradient is only 0(p). Therefore, if we use the 
public data to calculate the Hessian matrix, the total sen- 
sitivity can be reduced from 0(p 2 ) to 0(p), which 
increases the accuracy of update steps. 

Unlike the traditional Newton-Raphson algorithm, which 
iterates until convergence, our method uses a fixed number 
of iterations for the following reasons. First, the original 
Newton method stops after parameters converge but our 
algorithm will never terminate due to the noise added in 
each step. Second, the stop decision in our case cannot be 
accurately determined by comparing the likelihood asso- 
ciated with p new and Pold because it is possible that the 



noisy j3 new performs worse than /3 0 id> which will never hap- 
pen in the standard maximum likelihood estimation. 
Finally, a fixed number of iterations allows us to allocate 
privacy budgets easily, e.g., evenly split across iterations as 
in this work. 

The details of our update step are illustrated in Algo- 
rithm 1, and the full model is described in Algorithm 2. 

There are two some modifications that can further 
improve the performance. Please refer to Additional 
file 1, for these modifications and for the proof of differ- 
ential privacy. 

Algorithm 1 

Modified update step in distributed logistic regression 
Input 

Private dataset D\, ...,Dy public dataset D 0 , privacy bud- 
get for this iteration fo, coefficient of penalty X, the upper 
bound of L 2 norm of samples M, and old parameter p o ld 
obtained from the previous iteration. 

Output 

Logistic regression parameter fi new . 

1: Compute the Hessian matrix H using only the public 
data set (/ is the identity matrix) 



H = £>g- 



n 0 X 



■1 



The coefficient is an adjustment, as the number 

of samples to obtain H here is different from number of 
samples to get gradients below. 
2: Compute the gradient grad for each data set. 



graS = noise' + ^ 



grad 0 = — 



expty, 0 ^) 



where noise',] = \, ... ,k are iid vectors with density 



p (noise*) cc exp 



to 1 1 noise' 1 1 - 
2M 



3: Aggregate all the gradients 

k 

grad = J2grad> - Xfi oU 

j=0 

4: Output P new = $ M ' 



n 0 



Algorithm 2 

Distributed logistic regression 



H 1 grad 
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Input 

Private dataset D 1( Dfe, public dataset D 0 , privacy bud- 
get for this iteration 6, coefficient of penalty X, the upper 
bound of L 2 norm of samples M, and iteration times /. 

Output 

Logistic regression parameter ft. 
1: 6 0 = 6/1 

2: Initialize logistic regression parameter fio = 0, a vec- 
tor with the same length as x. 

3: Given the data sets, e 0 ,X and M, use as p o ld, 
update with p new in Algorithm 1. Repeat for / times 

4: Output [3 

Experiments and results 

We will compare our algorithm with two baselines. The 
first is the meta analysis method. It first adds noise 

r i- -i / i\ / eA.||noiW||2\ 
from distribution p (noise 1 ) oc exp I I to 

logistic regression parameters learned from a private 
data set, where M is the upper bound of predictors' L 2 
norm. Then, it outputs weighted average (by number of 
samples in each data set) of these noisy parameters 
learned from fe private data sets. By weighting locally 
learned differentially private parameters, this method is 
similar to the method in Pathak, Rane, Raj et al[19]. 
The only difference is that we make outputs from each 
private data be shared differentially privately (rather 
than transmitting encrypted partial local outputs). Our 
second baseline is to train a logistic regression model 
with only public data and neglect the private data sets. 
Our evaluation metric is model discrimination, i.e., the 
Area Under the ROC curve (AUC). We will explore the 
performance of all three models using various parameter 
values. Unless explicitly illustrated in the Figures, default 
values for the following parameters are set as follows: 
number of private data sets (3), fraction of public data 
in the training set (2%), privacy budget e (1), number of 
iterations (2). The regularization strength X for all three 
models is selected from a ranee [l(T 2 ,i0 6 ] to maximize 
the expected AUC based on a 10-fold cross validation. 

We used clinical data to conduct experiments, where 
the public and private data sets are split randomly. Each 
private data set contains roughly the same number of 
observations. We used a 60%/40% split for training and 
testing in all experiments. 

Data sets 

We used three data sets, i.e., Schumacher's breast cancer 
[27], hospital discharge [28], and the SEER breast cancer 
[29]. A summary is showed in Table 1, which includes 
data description, number of attributes, number of sam- 
ples, and the class label distribution. Table II lists the 



attribute description for each data set, where numerical 
attributes are indicated by "*", and non-binary categori- 
cal attributes were converted into binary ones through 
dummy coding. For example, a categorical attribute of c 
categories will be converted into c - 1 binary covariates 
in dummy coding (e.g., 0 -» (1,0), 1 -» (0,1) and 2 -> 
(0,0) for the case c = 3). The class label attributes are 
shown in the last row of Table 2. 

As the magnitudes of attributes have large impact on the 
overall sensitivity of logistic regression parameters and the 
gradients of the log-likelihood function, we normalized all 
attributes and truncated their values to [-2,2] in order to 
bound the impact. Note that we used the mean and the 
variance of public data to conduct normalization, which 
does not incur a privacy cost. 

Results 

We first used the hospital discharge data set to explore 
the effect of different parameters on model discrimina- 
tion. Then, we evaluated and compared the model per- 
formance using all three datasets to check the impact of 
scalability. 

1) Model comparison using different parameters: In 
Figure la, we illustrated the effect of different number of 
private data sets on discrimination. The AUCs for our 
method and for the meta-analysis method drop at the 
same speed as the number of data sets increases, and our 
method results in higher AUCs in all scenarios. This is 
because both methods have to add noise to outputs from 
each private data set. Given a fixed number of observa- 
tions (8,668), more noise is added when there are more 
private data sets (evenly split). The public data based 
method is not affected by this setting. In the experiments 
of this section, AUCs have small standard deviations 
(around 0.05) and the mean of results from 100 indepen- 
dent experiments is stable enough to represent perfor- 
mance of the methods (the standard deviation of the mean 
is around 0.005). Therefore, we only plot the mean AUCs). 

Figure lb shows the AUCs of the models given differ- 
ent fractions of public data. Our algorithm's perfor- 
mance is stable, but the AUCs of the public data based 
model grow quickly with increased sample size. When 
the fraction of public data is between 1% and 5%, our 
algorithm is the best. 

Figure lc shows how AUCs change at different regulari- 
zation strengths (X in Algorithm 2). Our method has the 
most stable performance even when the regularization 
strength increases 100 times, or when it is 10% of the opti- 
mal value. This is very important in practice, as it is 
expensive to calculate the best regularization strength (i.e., 
have to reserve additional test data for tuning regulariza- 
tion strength, which will also spend some privacy budget). 
With the stable performance, we can hypothesize that 
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Table 1 Summary of data sets used in our experiments 



Data set 


Data set description 


# of 
attributes 


# of 
samples 


Class distribution (negative/positive) 


1 


German breast cancer 


9 


686 


43.6% / 56.4% 


2 


Hospital discharge 


17 


8,668 


4.4% / 95.6% 


3 


SEER breast Cancer 


37 


55,000 


21.0% / 79.0% 



even if the guess is away from the optimal regularization 
strength, our method can perform well in terms of 
discrimination. 

Figure Id shows the effect of the privacy budget 
<(in Algorithm 2) on AUCs. The method based on public 
data is not affected, but the other two have better perfor- 
mance with more budgets. Our algorithm is better than 
the meta analysis model in general, and outperforms the 
public data based method when the privacy budget is lar- 
ger than 1. 

Figure le shows how the number of iterations in dis- 
tributed logistic regression algorithm I in Algorithm 2) 
will affect our algorithm's performance. When there are 
no iterations (i.e., only locally calculated parameters are 
used), our method degrades to the public data based 
algorithm (the same starting point). Our performance is 
best with less than 3 iterations, and it gets worse with 
more iterations. There are several reasons for the 
degraded performance. First, our algorithm uses para- 
meters trained on public data set as the starting point, 
which is expected to be close to the real one, and there- 
fore, the necessary iterations are smaller than the 



traditional Newton method (starting from all zeros). 
Second, as noise is added to each iteration, the gain 
from later iterations can be masked by the increased 
amount of noise. Yet another reason is that our privacy 
budget is evenly split into a pre-determined number of 
iterations, and more iterations imply a large amount of 
total noise. 

A natural question is when our method would be 
most useful. This is a hard question to answer in theory 
because our method has no guarantee of convergence 
(due to the added noise). However, we can answer the 
question empirically. 

We decompose parameters into "external factors" and 
"controllable factors". The former corresponds to: the 
number of private data sets, the fraction of data that are 
public, and the private budget (set by the data owner), 
which researchers cannot control. The regularization 
strength and iteration numbers, however, are not exter- 
nal factors, as researchers can choose them. 

In Figure 2, we showed how the three methods per- 
form (Using their corresponding "best" controllable fac- 
tors) given different external factors, which include the 



Table 2 Attribute description for each data set, where numerical attributes are indicated with "*", non-binary 
categorical attributes were converted into binary representations through dummy coding and classification labels are 
shown in the last row. 



Data set 1 


Data set 2 


Data set 3 


Hormonal therapy 
1 . Yes, 2. No. 


Specimen 
1. Blood, 2. Urine, 3. sputum, 4. CSF 


Race (25 categories) 


Age* 


Specific days* 


Age* 


Menopausal status 

1. Premenopausal, 2. Postmenopausal 


Day of the week for collection 

1 . Weekday, 2. Weekend 


Marital status (6 categories) 


Tumor size* 


Age* 


Histology* 


Tumor grade* 

(Levels I, II, III) 


Day of the week for the final result 

1 . Weekday, 2. Weekend 


Number of nodes examined* 


Number of positive nodes* 


Gender 

1. Male, 
2. Female 


Number of positive nodes* 


Recurrence free Survival time* 

(in days) 


Insurance 

1. Medicare, 2. Medicaid, 
3. Commercial, 4. Other 


Grade* 


Progesterone receptor* 




Tumor size* 


Estrogen receptor* 


Race 

1. White, 2. Black, 3. Asian, 4. Hispanic, 
5. unknown/declined 


ER status 

(4 categories) 


Status indicator 


Potential error 


Vital status recode 


Pos: Alive, Neg: Died 


Pos: Not a potential follow-up error, Neg: A potential follow-up error 


Pos: Alive, Neg: Died 
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Figure 1 Effect of different parameters on model discrimination. Note that in 1(e), only our model is affected by iteration numbers. 



numbers of private data sets (1, 3, 5, 10 and 20), the 
percentages of public data (0.5%, 1%, 2%, 5%, 10%) and 
the privacy budgets (0.5, 1, 1.5, 2, 5). Each rectangle 
represents a comparison between our method and the 



best of the other two approaches. Red and yellow rec- 
tangles indicate our algorithm is better, while green and 
blue rectangles mean the opposite. Our algorithm per- 
forms well under many situations, including: (1) privacy 
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Privacy Budget = 0.5 



1 private data set 
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(e) Number of private data sets 

Figure 2 Comparison of three methods given different external 
privacy budget. Red and yellow rectangles indicate our algorithm 
opposite. 
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budget equal or larger than 1, (2) 5-10 private data sets, 
(3) percentage of public data around 1-5%. 

2) Model comparison using different datasets: Our last 
study evaluated model performance using data of differ- 
ent sizes. Three biomedical datasets used in this experi- 
ment differ in the number of observations. All input 
parameters are set to default values. Each experiment is 
repeated 100 times to generate a boxplot. 

In Figure 3, our algorithm is compared to the meta-ana- 
lysis model and the logistic regression model trained on 
public data sets. In all the three data sets, our algorithm 
shows the best performance as indicated by the p-value 



(the p-values are calculated using the pairwise one-sided 
student-t test). 

Discussion and conclusion 

We demonstrated a novel approach that combines pub- 
lic and private data sets to build a logistic regression 
model in a distributed manner. Our approach shows 
performance advantage over two other approaches 
under various conditions. There are still challenges in 
using our approach in practice. For example, categorical 
attribute values in private data sets may not appear in 
public data. The simplest solution is to pre-process 
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Figure 3 Boxplot comparisons of models using three different datasets. We use default parameter values as stated in the beginning of 
this section. For each method, the five lines from bottom to top are 2.5%, 25%, 50%, 75% and 97.5% quantiles of AUCs. The p-value of pairwise 
t-test on AUCs are also shown. 



private data sets by dropping values that rarely appear 
(or do not appear) in public data and that seem uncor- 
rected to the labels. This will not spend privacy budget 
as we only use public data to guide the process. It may 
improve a model's utility by removing some values and 
reducing the number of attributes. The upper bound of 
L2 norm for the predictors therefore gets reduced, and 
so does the scale of noise. 

Another limitation of our method is that we assume that 
all data sets follow the same (joint) distribution. However, 
it is possible that some data sets have sample bias, but 



may still follow the same learning rule. A solution for this 
is to use gradient descent instead of the Newton method 
in Algorithm 1, However, as gradient descent algorithm 
usually needs more iterations to get an accurate solution, 
and consequently it may add more noise and therefore 
generate less satisfactory outputs. In conclusion, we pro- 
pose a new algorithm to extend the differential private fra- 
mework to real world scenarios in biomedical research, 
where public and private data sets are available for analy- 
sis. Hybrid approaches that rigorously protect private data 
while leveraging public data to improve the utility show 
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great promise to achieve "the best of both worlds" (i.e., 
data privacy and usefulness). 
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